近期发现 prometheus 出现 discovery=file msg="Error adding file watcher" err="too many open files"的告警。且更新prometheus json 文件后,不能及时的刷新,需要等待很久。

报错信息如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ systemctl status prometheus.service -l
● prometheus.service - Prometheus Voice
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2020-06-03 13:45:32 +08; 1 day 3h ago
Docs: https://github.com/prometheus/prometheus
Main PID: 281503 (prometheus)
CGroup: /system.slice/prometheus.service
└─281503 /data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus --config.file=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml --web.listen-address=0.0.0.0:9090 --storage.tsdb.retention=60d --web.enable-lifecycle --web.external-url=http://143.92.123.121:9090 --query.max-samples=500000000 --query.timeout=20m --query.max-concurrency=200
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"

通过观察报错,字面意思是 open files 打开的太多了,于是检查 open files 的设定。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 2056770
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1024000
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ cat /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Voice
Documentation=https://github.com/prometheus/prometheus
After=network.target
[Service]
WorkingDirectory=/data1/prometheus/prometheus-2.16.0.linux-amd64
ExecStart=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus --config.file=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml --web.listen-address=0.0.0.0:9090 --storage.tsdb.retention=60d --web.enable-lifecycle --web.external-url=http://143.92.123.121:9090 --query.max-samples=500000000 --query.timeout=20m --query.max-concurrency=200
ExecReload=/usr/bin/curl -X POST http://localhost:9090/-/reload
Restart=on-failure
RestartSec=5
Type=simple
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target

通过检查发现,open files设定正确,但仍未解决问题。
通过查找资料发现,是因为fs.inotify.max_user_instances 默认值设定太小导致的。表示每一个real user ID可创建的inotify instatnces的数量上限,默认128

1
2
$ cat /proc/sys/fs/inotify/max_user_instances
128

1
2
3
$ tail -1 /etc/sysctl.conf
fs.inotify.max_user_instances=65000
$ sysctl -p

问题解决

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ systemctl status prometheus.service -l
● prometheus.service - Prometheus Voice
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2020-06-03 13:45:32 +08; 1 day 3h ago
Docs: https://github.com/prometheus/prometheus
Main PID: 281503 (prometheus)
CGroup: /system.slice/prometheus.service
└─281503 /data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus --config.file=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml --web.listen-address=0.0.0.0:9090 --storage.tsdb.retention=60d --web.enable-lifecycle --web.external-url=http://143.92.123.121:9090 --query.max-samples=500000000 --query.timeout=20m --query.max-concurrency=200
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 17:00:04 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:00:04.829Z caller=main.go:747 msg="Loading configuration file" filename=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml
Jun 04 17:00:04 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:00:04.864Z caller=main.go:775 msg="Completed loading of configuration file" filename=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml
Jun 04 17:00:29 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:00:29.603Z caller=main.go:747 msg="Loading configuration file" filename=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml
Jun 04 17:00:29 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:00:29.634Z caller=main.go:775 msg="Completed loading of configuration file" filename=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml
Jun 04 17:03:07 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:03:07.125Z caller=compact.go:496 component=tsdb msg="write block" mint=1591250400000 maxt=1591257600000 ulid=01E9Z8XZN7M530V866W4WFC3Y6 duration=3m7.085617933s
Jun 04 17:03:18 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:03:18.535Z caller=head.go:661 component=tsdb msg="head GC completed" duration=7.757127512s

参考:
https://blog.csdn.net/weiguang1017/article/details/54381439
https://groups.google.com/forum/#!topic/prometheus-users/OQzEYeggnpw


本文出自”Jack Wang Blog”:http://www.yfshare.vip/2022/11/27/解决Prometheus too many open files问题/